Reference: http://blog.fens.me/r-stringr/
用R语言处理字符串,总觉得很麻烦,即不能用向量的方法进行分割,也不能用循环遍历索引。grep()家族函数常常记不住,paste()函数默认以空格分割,各种不顺手啊!随着使用R语言的场景越来越多,字符串处理是必不可少的。给大家推荐一个由 Hadley Wickham 开发的一个灵活的字符串处理包stringr。
stringr介绍
stringr安装
stringr的API介绍
stringr包被定义为一致的、简单易用的字符串工具集。所有的函数和参数定义都具有一致性,比如,用相同的方法进行NA处理和0长度的向量处理。
字符串处理虽然不是R语言中最主要的功能,却也是必不可少的,数据清洗、可视化等的操作都会用到。对于R语言本身的base包提供的字符串基础函数,随着时间的积累,已经变得很多地方不一致,不规范的命名,不标准的参数定义,很难看一眼就上手使用。字符串处理在其他语言中都是非常方便的事情,R语言在这方面确实落后了。stringr包就是为了解决这个问题,让字符串处理变得简单易用,提供友好的字符串操作接口。
stringr的项目主页:https://cran.r-project.org/web/packages/stringr/index.html
本文所使用的系统环境
Win10 64bit R: 3.2.3 x86_64-w64-mingw32/x64 b4bit
stringr是在CRAN发布的标准库,安装起来非常简单,2条命令就可以了。
# ~ R
# Load stringr package
if(!suppressWarnings(require("stringr")))
{
install.packages("stringr")
require("stringr")
}
# ~ R
# install.packages('stringr')
# library(stringr)
stringr包1.0.0版本,一共提供了30个函数,方便我们对字符串处理。常用的字符串的处理以str_开头来命名,方便更直观理解函数的定义。我们可以根据使用习惯对函数进行分类:
str_c
: 字符串拼接。str_join
: 字符串拼接,同str_c
。str_trim
: 去掉字符串的空格和TAB()str_pad
: 补充字符串的长度str_dup
: 复制字符串str_wrap
: 控制字符串输出格式str_sub
: 截取字符串str_sub
: 截取字符串,并赋值,同str_sub
函数定义-Description & Usage:
str_c(..., sep = "", collapse = NULL)
str_join(..., sep = "", collapse = NULL)
参数列表-Arguments:
…
: 多参数的输入sep
: 把多个字符串拼接为一个大的字符串,用于字符串的分割符。collapse
: 把多个向量参数拼接为一个大的字符串,用于字符串的分割符。把多个字符串拼接为一个大的字符串。Join multiple strings into a single string.
str_c('a','b')
# [1] "ab"
str_c('a','b',sep='-')
# [1] "a-b"
str_c(c('a','a1'),c('b','b1'),sep='-')
# [1] "a-b" "a1-b1"
# More Examples
str_c("Letter: ", letters)
# [1] "Letter: a" "Letter: b" "Letter: c" "Letter: d"
# [5] "Letter: e" "Letter: f" "Letter: g" "Letter: h"
# [9] "Letter: i" "Letter: j" "Letter: k" "Letter: l"
# [13] "Letter: m" "Letter: n" "Letter: o" "Letter: p"
# [17] "Letter: q" "Letter: r" "Letter: s" "Letter: t"
# [21] "Letter: u" "Letter: v" "Letter: w" "Letter: x"
# [25] "Letter: y" "Letter: z"
str_c("Letter", letters, sep = ": ")
# [1] "Letter: a" "Letter: b" "Letter: c" "Letter: d"
# [5] "Letter: e" "Letter: f" "Letter: g" "Letter: h"
# [9] "Letter: i" "Letter: j" "Letter: k" "Letter: l"
# [13] "Letter: m" "Letter: n" "Letter: o" "Letter: p"
# [17] "Letter: q" "Letter: r" "Letter: s" "Letter: t"
# [21] "Letter: u" "Letter: v" "Letter: w" "Letter: x"
# [25] "Letter: y" "Letter: z"
str_c(letters, " is for", "...")
# [1] "a is for..." "b is for..." "c is for..."
# [4] "d is for..." "e is for..." "f is for..."
# [7] "g is for..." "h is for..." "i is for..."
# [10] "j is for..." "k is for..." "l is for..."
# [13] "m is for..." "n is for..." "o is for..."
# [16] "p is for..." "q is for..." "r is for..."
# [19] "s is for..." "t is for..." "u is for..."
# [22] "v is for..." "w is for..." "x is for..."
# [25] "y is for..." "z is for..."
str_c(letters[-26], " comes before ", letters[-1])
# [1] "a comes before b" "b comes before c"
# [3] "c comes before d" "d comes before e"
# [5] "e comes before f" "f comes before g"
# [7] "g comes before h" "h comes before i"
# [9] "i comes before j" "j comes before k"
# [11] "k comes before l" "l comes before m"
# [13] "m comes before n" "n comes before o"
# [15] "o comes before p" "p comes before q"
# [17] "q comes before r" "r comes before s"
# [19] "s comes before t" "t comes before u"
# [21] "u comes before v" "v comes before w"
# [23] "w comes before x" "x comes before y"
# [25] "y comes before z"
把多个向量参数拼接为一个大的字符串。
str_c(head(letters), collapse = "")
# [1] "abcdef"
str_c(head(letters), collapse = ", ")
# [1] "a, b, c, d, e, f"
# collapse参数,对多个字符串无效
str_c('a','b',collapse = "-")
# [1] "ab"
str_c(c('a','a1'),c('b','b1'),collapse='-')
# [1] "ab-a1b1"
拼接有NA值的字符串向量时,NA还是NA
# Missing inputs give missing outputs
str_c(c("a", NA, "b"), "-d")
# [1] "a-d" NA "b-d"
# Use str_replace_NA to display literal NAs:
str_c(str_replace_na(c("a", NA, "b")), "-d")
# [1] "a-d" "NA-d" "b-d"
对比str_c()
函数和paste()
函数之间的不同点。
# 多字符串拼接,默认的sep参数行为不一致
str_c('a','b')
# [1] "ab"
paste('a','b')
# [1] "a b"
# 向量拼接字符串,collapse参数的行为一致
str_c(head(letters), collapse = "")
# [1] "abcdef"
paste(head(letters), collapse = "")
# [1] "abcdef"
#拼接有NA值的字符串向量,对NA的处理行为不一致
str_c(c("a", NA, "b"), "-d")
# [1] "a-d" NA "b-d"
paste(c("a", NA, "b"), "-d")
# [1] "a -d" "NA -d" "b -d"
函数定义-Description & Usage:
str_trim(string, side = c("both", "left", "right"))
参数列表-Arguments:
string
: 字符串,字符串向量。side
: 过滤方式,both
两边都过滤,left
左边过滤,right
右边过滤去掉字符串的空格和TAB(),Trim whitespace from start and end of string.
#只过滤左边的空格
str_trim(" left space\t\n",side='left')
# [1] "left space\t\n"
#只过滤右边的空格
str_trim(" left space\t\n",side='right')
# [1] " left space"
#过滤两边的空格
str_trim(" left space\t\n",side='both')
# [1] "left space"
#过滤两边的空格,default
str_trim("\nno space\n\t")
# [1] "no space"
# More Examples
cat(" String with trailing and leading white space\t")
# String with trailing and leading white space
str_trim(" String with trailing and leading white space\t")
# [1] "String with trailing and leading white space"
cat("\n\nString with trailing and leading white space \n\n")
#
#
# String with trailing and leading white space
str_trim("\n\nString with trailing and leading white space \n\n")
# [1] "String with trailing and leading white space"
函数定义-Description & Usage:
str_pad(string, width, side = c("left", "right", "both"), pad = " ")
参数列表-Arguments:
string
: 字符串,字符串向量。width
: 字符串填充后的长度side
: 填充方向,both
两边都填充,left
左边填充,right
右边填充pad
: 用于填充的字符补充字符串的长度。Pad a string
# 从左边补充空格,直到字符串长度为20
str_pad("Stone_Hou", 20, "left")
# [1] " Stone_Hou"
# 从右边补充空格,直到字符串长度为20
str_pad("Stone_Hou", 20, "right")
# [1] "Stone_Hou "
# 从左右两边各补充空格,直到字符串长度为20
str_pad("Stone_Hou", 20, "both")
# [1] " Stone_Hou "
# 从左右两边各补充x字符,直到字符串长度为20
str_pad("Stone_Hou", 20, "both",'z')
# [1] "zzzzzStone_Houzzzzzz"
# More Examples
rbind(
str_pad("hadley", 30, "left"),
str_pad("hadley", 30, "right"),
str_pad("hadley", 30, "both")
)
# [,1]
# [1,] " hadley"
# [2,] "hadley "
# [3,] " hadley "
# All arguments are vectorised except side
str_pad(c("a", "abc", "abcdef"), 10)
# [1] " a" " abc" " abcdef"
str_pad("a", c(5, 10, 20))
# [1] " a" " a"
# [3] " a"
str_pad("a", 10, pad = c("-", "_", " "))
# [1] "---------a" "_________a" " a"
# Longer strings than width are returned unchanged
str_pad("hadley", 3)
# [1] "hadley"
函数定义-Description & Usage:
str_dup(string, times)
参数列表-Arguments:
string
: 字符串,字符串向量。times
: 复制数量复制一个字符串向量。Duplicate and concatenate strings within a character vector
val <- c("abca4", 123, "cba2")
# 复制2次
str_dup(val, 2)
# [1] "abca4abca4" "123123" "cba2cba2"
# 按位置复制
str_dup(val, 1:3)
# [1] "abca4" "123123" "cba2cba2cba2"
# More Examples
fruit <- c("apple", "pear", "banana");fruit
# [1] "apple" "pear" "banana"
str_dup(fruit, 2)
# [1] "appleapple" "pearpear" "bananabanana"
str_dup(fruit, 1:3)
# [1] "apple" "pearpear"
# [3] "bananabananabanana"
str_c("ba", str_dup("na", 0:5))
# [1] "ba" "bana" "banana"
# [4] "bananana" "banananana" "bananananana"
函数定义-Description & Usage:
str_wrap(string, width = 80, indent = 0, exdent = 0)
参数列表-Arguments:
string
: 字符串,字符串向量。width
: 设置一行所占的宽度。indent
: 段落首行的缩进值exdent
: 段落非首行的缩进值txt <- 'R语言作为统计学一门语言,一直在小众领域闪耀着光芒。直到大数据的爆发,R语言变成了一门炙手可热的数据分析的利器。随着越来越多的工程背景的人的加入,R语言的社区在迅速扩大成长。现在已不仅仅是统计领域,教育,银行,电商,互联网….都在使用R语言。'
# 设置宽度为40个字符
cat(str_wrap(txt, width = 40), "\n")
# R语言作为统计学一门语言,一直在小众领域
# 闪耀着光芒。直到大数据的爆发,R语言变成
# 了一门炙手可热的数据分析的利器。随着越来
# 越多的工程背景的人的加入,R语言的社区在
# 迅速扩大成长。现在已不仅仅是统计领域,教
# 育,银行,电商,互联网….都在使用R语言。
# 设置宽度为60字符,首行缩进2字符
cat(str_wrap(txt, width = 60, indent = 2), "\n")
# R语言作为统计学一门语言,一直在小众领域闪耀着光芒。直到大数
# 据的爆发,R语言变成了一门炙手可热的数据分析的利器。随着越来
# 越多的工程背景的人的加入,R语言的社区在迅速扩大成长。现在已
# 不仅仅是统计领域,教育,银行,电商,互联网….都在使用R语言。
# 设置宽度为10字符,非首行缩进4字符
cat(str_wrap(txt, width = 10, exdent = 4), "\n")
# R语言作为
# 统计学一
# 门语言,
# 一直在小
# 众领域闪
# 耀着光芒。
# 直到大数据
# 的爆发,R
# 语言变成了
# 一门炙手可
# 热的数据分
# 析的利器。
# 随着越来
# 越多的工程
# 背景的人的
# 加入,R语
# 言的社区在
# 迅速扩大成
# 长。现在已
# 不仅仅是统
# 计领域,教
# 育,银行,
# 电商,互联
# 网….都在使
# 用R语言。
# More Examples
thanks_path <- file.path(R.home("doc"), "THANKS")
thanks <- str_c(readLines(thanks_path), collapse = "\n")
thanks <- word(thanks, 1, 3, fixed("\n\n"))
cat(str_wrap(thanks), "\n")
cat(str_wrap(thanks, width = 40), "\n")
cat(str_wrap(thanks, width = 60, indent = 2), "\n")
cat(str_wrap(thanks, width = 60, exdent = 2), "\n")
# one word one row
cat(str_wrap(thanks, width = 0, exdent = 2), "\n")
函数定义-Description & Usage:
str_sub(string, start = 1L, end = -1L)
参数列表-Arguments:
string
: 字符串,字符串向量。start
: 开始位置end
: 结束位置截取字符串。
txt <- "I am Conan."
# 截取1-4的索引位置的字符串
str_sub(txt, 1, 4)
# [1] "I am"
# 截取1-6的索引位置的字符串
str_sub(txt, end=6)
# [1] "I am C"
# 截取6到结束的索引位置的字符串,default to end length(txt)
str_sub(txt, 6)
# [1] "Conan."
# 分2段截取字符串
str_sub(txt, c(1, 4), c(6, 8))
# [1] "I am C" "m Con"
# 通过负坐标截取字符串 Negative indices
str_sub(txt, -3)
# [1] "an."
str_sub(txt, end = -3)
# [1] "I am Cona"
# More Example From str_sub help
hw <- "Hadley Wickham"
str_sub(hw, 1, 6)
# [1] "Hadley"
str_sub(hw, end = 6)
# [1] "Hadley"
str_sub(hw, 8, 14)
# [1] "Wickham"
str_sub(hw, 8)
# [1] "Wickham"
str_sub(hw, c(1, 8), c(6, 14))
# str_sub(hw, c(1, 8), c(6, 14))
# Negative indices
str_sub(hw, -1)
# [1] "m"
str_sub(hw, -7)
# [1] "Wickham"
str_sub(hw, end = -7)
# [1] "Hadley W"
# Alternatively, you can pass in a two colum matrix, as in the
# output from str_locate_all
pos <- str_locate_all(hw, "[aeio]")[[1]]
str_sub(hw, pos)
str_sub(hw, pos[, 1], pos[, 2])
# Vectorisation
str_sub(hw, seq_len(str_length(hw)))
# [1] "Hadley Wickham" "adley Wickham"
# [3] "dley Wickham" "ley Wickham"
# [5] "ey Wickham" "y Wickham"
# [7] " Wickham" "Wickham"
# [9] "ickham" "ckham"
# [11] "kham" "ham"
# [13] "am" "m"
str_sub(hw, end = seq_len(str_length(hw)))
# [1] "H" "Ha"
# [3] "Had" "Hadl"
# [5] "Hadle" "Hadley"
# [7] "Hadley " "Hadley W"
# [9] "Hadley Wi" "Hadley Wic"
# [11] "Hadley Wick" "Hadley Wickh"
# [13] "Hadley Wickha" "Hadley Wickham"
# Replacement form
x <- "BBCDEF"
str_sub(x, 1, 1) <- "A"; x
# [1] "ABCDEF"
str_sub(x, -1, -1) <- "K"; x
# [1] "ABCDEK"
# replace E in ABCDEK with GHIJ
str_sub(x, -2, -2) <- "GHIJ"; x
# [1] "ABCDGHIJK"
# replace BCDGHIJ in ABCDGHIJK with non
str_sub(x, 2, -2) <- ""; x
对截取的字符串进行赋值。Extract and replace substrings from a character vector.
x <- "AAABBBCCC"
# 在字符串的1的位置赋值为1
str_sub(x, 1, 1) <- 1; x
# [1] "1AABBBCCC"
# 在字符串从2到-2的位置赋值为2345
str_sub(x, 2, -2) <- "2345"; x
# [1] "12345C"
str_count
: 字符串计数str_length
: 字符串长度str_sort
: 字符串值排序str_order
: 字符串索引排序,规则同str_sort
函数定义-Description & Usage:
str_count(string, pattern = "")
参数列表-Arguments:
string
: 字符串,字符串向量。pattern
: 匹配的字符。对字符串中匹配的字符计数
# Count the number of matches in a string.
str_count('aaa444sssddd', "a")
# [1] 3
对字符串向量中匹配的字符计数
fruit <- c("apple", "banana", "pear", "pineapple")
str_count(fruit, "a")
# [1] 1 3 1 1
str_count(fruit, "p")
# [1] 2 0 1 3
str_count(fruit, "e")
# [1] 1 0 1 2
# matches 1-1
str_count(fruit, c("a", "b", "p", "p"))
# [1] 1 1 1 3
对字符串中的’.’字符计数,由于.是正则表达式的匹配符,直接判断计数的结果是不对的。
str_count(c("a.", ".", ".a.",NA), ".")
# [1] 2 1 3 NA
# 用fixed匹配字符
str_count(c("a.", ".", ".a.",NA), fixed("."))
# [1] 1 1 2 NA
# 用\\匹配字符
str_count(c("a.", ".", ".a.",NA), "\\.")
# [1] 1 1 2 NA
函数定义-Description & Usage:
str_length(string)
参数列表-Arguments:
string
: 字符串,字符串向量。计算字符串的长度,The length of a string :
str_length(c("I", "am", "张丹", NA))
# [1] 1 2 2 NA
# More Examples
str_length(letters)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [26] 1
str_length(NA)
# [1] NA
str_length(factor("abc"))
# [1] 3
str_length(c("i", "like", "programming", NA))
# [1] 1 4 11 NA
# Two ways of representing a u with an umlaut
u1 <- "\u00fc"
u2 <- stringi::stri_trans_nfd(u1)
# The print the same:
u1
u2
# But have a different length
str_length(u1)
# [1] 1
str_length(u2)
# [1] 2
# Even though they have the same number of characters
str_count(u1)
# [1] 1
str_count(u2)
# [1] 1
函数定义-Description & Usage:
str_sort(x, decreasing = FALSE, na_last = TRUE, locale = "", ...)
str_order(x, decreasing = FALSE, na_last = TRUE, locale = "", ...)
参数列表-Arguments:
x
: 字符串,字符串向量。decreasing
: 排序方向。na_last
:NA
值的存放位置,一共3个值,TRUE
放到最后,FALSE
放到最前,NA
过滤处理locale
:按哪种语言习惯排序对字符串值进行排序。Order or sort a character vector.
# 按ASCII字母排序
str_sort(c('a',1,2,'11'), locale = "en")
# [1] "1" "11" "2" "a"
# 倒序排序
str_sort(letters,decreasing=TRUE)
# [1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h"
# [20] "g" "f" "e" "d" "c" "b" "a"
# 按拼音排序
str_sort(c('你','好','粉','丝','日','志'),locale = "zh")
# [1] "粉" "好" "你" "日" "丝" "志"
# More examples
str_order(letters)
str_sort(letters)
str_order(letters, locale = "haw")
str_sort(letters, locale = "haw")
x <- c("100a10", "100a5", "2b", "2a")
str_sort(x)
str_sort(x, numeric = TRUE)
对NA值的排序处理
#把NA放最后面
str_sort(c(NA,'1',NA),na_last=TRUE)
# [1] "1" NA NA
#把NA放最前面
str_sort(c(NA,'1',NA),na_last=FALSE)
# [1] NA NA "1"
#去掉NA值
str_sort(c(NA,'1',NA),na_last=NA)
# [1] "1"
str_split
: 字符串分割str_split_fixed
: 字符串分割,同str_split
str_subset
: 返回匹配的字符串word
: 从文本中提取单词str_detect
: 检查匹配字符串的字符str_match
: 从字符串中提取匹配组。str_match_all
: 从字符串中提取匹配组,同str_match
str_replace
: 字符串替换str_replace_all
: 字符串替换,同str_replace
str_replace_na
:把NA替换为NA字符串str_locate
: 找到匹配的字符串的位置。str_locate_all
: 找到匹配的字符串的位置,同str_locate
str_extract
: 从字符串中提取匹配字符str_extract_all
: 从字符串中提取匹配字符,同str_extract
函数定义-Description & Usage:
str_split(string, pattern, n = Inf)
str_split_fixed(string, pattern, n)
参数列表-Arguments:
string
: 字符串,字符串向量。pattern
: 匹配的字符。n
: 分割个数对字符串进行分割。Split up a string into pieces.
val <- "abc,123,234,iuuu"
# 以,进行分割
s1<-str_split(val, ",");s1
# [[1]]
# [1] "abc" "123" "234" "iuuu"
# 以,进行分割,保留2块
s2<-str_split(val, ",",2);s2
# [[1]]
# [1] "abc" "123,234,iuuu"
# 查看str_split()函数操作的结果类型list
class(s1)
# [1] "list"
# 用str_split_fixed()函数分割,结果类型是matrix
s3<-str_split_fixed(val, ",",2);s3
# [,1] [,2]
# [1,] "abc" "123,234,iuuu"
class(s3)
# [1] "matrix"
# more examples
fruits <- c(
"apples and oranges and pears and bananas",
"pineapples and mangos and guavas"
)
str_split(fruits, " and ")
# [[1]]
# [1] "apples" "oranges" "pears" "bananas"
#
# [[2]]
# [1] "pineapples" "mangos" "guavas"
str_split(fruits, " and ", simplify = TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "apples" "oranges" "pears" "bananas"
# [2,] "pineapples" "mangos" "guavas" ""
# Specify n to restrict the number of possible matches
str_split(fruits, " and ", n = 3)
# [[1]]
# [1] "apples" "oranges"
# [3] "pears and bananas"
#
# [[2]]
# [1] "pineapples" "mangos" "guavas"
str_split(fruits, " and ", n = 2)
# [[1]]
# [1] "apples"
# [2] "oranges and pears and bananas"
#
# [[2]]
# [1] "pineapples" "mangos and guavas"
# If n greater than number of pieces, no padding occurs
str_split(fruits, " and ", n = 5)
# [[1]]
# [1] "apples" "oranges" "pears" "bananas"
#
# [[2]]
# [1] "pineapples" "mangos" "guavas"
# Use fixed to return a character matrix
str_split_fixed(fruits, " and ", 3)
# [,1] [,2] [,3]
# [1,] "apples" "oranges" "pears and bananas"
# [2,] "pineapples" "mangos" "guavas"
str_split_fixed(fruits, " and ", 4)
# [,1] [,2] [,3] [,4]
# [1,] "apples" "oranges" "pears" "bananas"
# [2,] "pineapples" "mangos" "guavas" ""
函数定义-Description & Usage:
str_subset(string, pattern)
参数列表-Arguments:
string
: 字符串,字符串向量。pattern
: 匹配的字符。val <- c("abc", 123, "cba")
# 全文匹配
str_subset(val, "a")
# [1] "abc" "cba"
# 开头匹配
str_subset(val, "^a")
# [1] "abc"
# 结尾匹配
str_subset(val, "a$")
# [1] "cba"
函数定义-Description & Usage:
word(string, start = 1L, end = start, sep = fixed(" "))
参数列表-Arguments:
string
: 字符串,字符串向量。start
: 开始位置。end
: 结束位置。sep
: 匹配字符。Extract words from a sentence.
val <- c("I am Conan.", "http://fens.me, ok")
# 默认以空格分割,取第一个位置的字符串
word(val, 1)
# [1] "I" "http://fens.me,"
word(val, -1)
# [1] "Conan." "ok"
word(val, 2, -1)
# [1] "am Conan." "ok"
# 以,分割,取第一个位置的字符串
val<-'111,222,333,444'
word(val, 1, sep = fixed(','))
# [1] "111"
word(val, 3, sep = fixed(','))
# [1] "333"
函数定义-Description & Usage:
str_detect(string, pattern) 参数列表-Arguments:
string
: 字符串,字符串向量。pattern
: 匹配字符。Detect the presence or absence of a pattern in a string.
val <- c("abca4", 123, "cba2")
# 检查字符串向量,是否包括a
str_detect(val, "a")
# [1] TRUE FALSE TRUE
# 检查字符串向量,是否以a为开头
str_detect(val, "^a")
# [1] TRUE FALSE FALSE
# 检查字符串向量,是否以a为结尾
str_detect(val, "a$")
# [1] FALSE FALSE FALSE
# More Examples
fruit <- c("apple", "banana", "pear", "pinapple");fruit
# [1] "apple" "banana" "pear" "pinapple"
str_detect(fruit, "a")
# [1] TRUE TRUE TRUE TRUE
str_detect(fruit, "^a")
# [1] TRUE FALSE FALSE FALSE
str_detect(fruit, "a$")
# [1] FALSE TRUE FALSE FALSE
str_detect(fruit, "b")
# [1] FALSE TRUE FALSE FALSE
str_detect(fruit, "[aeiou]")
# Also vectorised over pattern
str_detect("aecfg", letters)
# [1] TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE
# [9] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [17] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [25] FALSE FALSE
函数定义-Description & Usage:
str_match(string, pattern)
str_match_all(string, pattern)
参数列表-Arguments:
string
: 字符串,字符串向量。pattern
: 匹配字符。从字符串中提取匹配组Extract matched groups from a string.
val <- c("abc", 123, "cba")
# 匹配字符a,并返回对应的字符
str_match(val, "a")
# [,1]
# [1,] "a"
# [2,] NA
# [3,] "a"
# 匹配字符0-9,限1个,并返回对应的字符
str_match(val, "[0-9]")
# [,1]
# [1,] NA
# [2,] "1"
# [3,] NA
# 匹配字符0-9,不限数量,并返回对应的字符
str_match(val, "[0-9]*")
# [,1]
# [1,] ""
# [2,] "123"
# [3,] ""
strings <- c(" 219 733 8965", "329-293-8753 ", "banana", "595 794 7569",
"387 287 6718", "apple", "233.398.9187 ", "482 952 3315",
"239 923 8115 and 842 566 4692", "Work: 579-499-7527", "$1000",
"Home: 543.355.3679");strings
# [1] " 219 733 8965"
# [2] "329-293-8753 "
# [3] "banana"
# [4] "595 794 7569"
# [5] "387 287 6718"
# [6] "apple"
# [7] "233.398.9187 "
# [8] "482 952 3315"
# [9] "239 923 8115 and 842 566 4692"
# [10] "Work: 579-499-7527"
# [11] "$1000"
# [12] "Home: 543.355.3679"
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})";phone
str_extract(strings, phone)
# [1] "219 733 8965" "329-293-8753" NA
# [4] "595 794 7569" "387 287 6718" NA
# [7] "233.398.9187" "482 952 3315" "239 923 8115"
# [10] "579-499-7527" NA "543.355.3679"
str_match(strings, phone)
# [,1] [,2] [,3] [,4]
# [1,] "219 733 8965" "219" "733" "8965"
# [2,] "329-293-8753" "329" "293" "8753"
# [3,] NA NA NA NA
# [4,] "595 794 7569" "595" "794" "7569"
# [5,] "387 287 6718" "387" "287" "6718"
# [6,] NA NA NA NA
# [7,] "233.398.9187" "233" "398" "9187"
# [8,] "482 952 3315" "482" "952" "3315"
# [9,] "239 923 8115" "239" "923" "8115"
# [10,] "579-499-7527" "579" "499" "7527"
# [11,] NA NA NA NA
# [12,] "543.355.3679" "543" "355" "3679"
# Extract/match all
str_extract_all(strings, phone)
# [[1]]
# [1] "219 733 8965"
#
# [[2]]
# [1] "329-293-8753"
#
# [[3]]
# character(0)
#
# [[4]]
# [1] "595 794 7569"
#
# [[5]]
# [1] "387 287 6718"
#
# [[6]]
# character(0)
#
# [[7]]
# [1] "233.398.9187"
#
# [[8]]
# [1] "482 952 3315"
#
# [[9]]
# [1] "239 923 8115" "842 566 4692"
#
# [[10]]
# [1] "579-499-7527"
#
# [[11]]
# character(0)
#
# [[12]]
# [1] "543.355.3679"
str_match_all(strings, phone)
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] "219 733 8965" "219" "733" "8965"
#
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,] "329-293-8753" "329" "293" "8753"
#
# [[3]]
# [,1] [,2] [,3] [,4]
#
# [[4]]
# [,1] [,2] [,3] [,4]
# [1,] "595 794 7569" "595" "794" "7569"
#
# [[5]]
# [,1] [,2] [,3] [,4]
# [1,] "387 287 6718" "387" "287" "6718"
#
# [[6]]
# [,1] [,2] [,3] [,4]
#
# [[7]]
# [,1] [,2] [,3] [,4]
# [1,] "233.398.9187" "233" "398" "9187"
#
# [[8]]
# [,1] [,2] [,3] [,4]
# [1,] "482 952 3315" "482" "952" "3315"
#
# [[9]]
# [,1] [,2] [,3] [,4]
# [1,] "239 923 8115" "239" "923" "8115"
# [2,] "842 566 4692" "842" "566" "4692"
#
# [[10]]
# [,1] [,2] [,3] [,4]
# [1,] "579-499-7527" "579" "499" "7527"
#
# [[11]]
# [,1] [,2] [,3] [,4]
#
# [[12]]
# [,1] [,2] [,3] [,4]
# [1,] "543.355.3679" "543" "355" "3679"
从字符串中提取匹配组,以字符串matrix格式返回
str_match_all(val, "a")
# [[1]]
# [,1]
# [1,] "a"
#
# [[2]]
# [,1]
#
# [[3]]
# [,1]
# [1,] "a"
str_match_all(val, "[0-9]")
# [[1]]
# [,1]
#
# [[2]]
# [,1]
# [1,] "1"
# [2,] "2"
# [3,] "3"
#
# [[3]]
# [,1]
x <- c("<a> <b>", "<a> <>", "<a>", "", NA);x
str_match(x, "<(.*?)> <(.*?)>")
# [,1] [,2] [,3]
# [1,] "<a> <b>" "a" "b"
# [2,] "<a> <>" "a" ""
# [3,] NA NA NA
# [4,] NA NA NA
# [5,] NA NA NA
str_match_all(x, "<(.*?)>")
# [[1]]
# [,1] [,2]
# [1,] "<a>" "a"
# [2,] "<b>" "b"
#
# [[2]]
# [,1] [,2]
# [1,] "<a>" "a"
# [2,] "<>" ""
#
# [[3]]
# [,1] [,2]
# [1,] "<a>" "a"
#
# [[4]]
# [,1] [,2]
#
# [[5]]
# [,1] [,2]
# [1,] NA NA
str_extract(x, "<.*?>")
# [1] "<a>" "<a>" "<a>" NA NA
str_extract_all(x, "<.*?>")
# [[1]]
# [1] "<a>" "<b>"
#
# [[2]]
# [1] "<a>" "<>"
#
# [[3]]
# [1] "<a>"
#
# [[4]]
# character(0)
#
# [[5]]
# [1] NA
函数定义-Description & Usage:
str_replace(string, pattern, replacement)
参数列表-Arguments:
string
: 字符串,字符串向量。pattern
: 匹配字符。replacement
: 用于替换的字符。Replace matched patterns in a string.
fruits %>%
str_c(collapse = "---") %>%
str_replace_all(c("one" = "1", "two" = "2", "three" = "3"))
[1] "1 apple---2 pears---3 bananas"
函数定义-Description & Usage:
str_replace_na(string, replacement = "NA")
参数列表-Arguments:
string
: 字符串,字符串向量。replacement
: 用于替换的字符。把NA替换为字符串Turn NA into “NA”
str_replace_na(c(NA,'NA',"abc"),'x')
# [1] "x" "NA" "abc"
str_replace_na(c(NA, "abc", "def"))
# [1] "NA" "abc" "def"
函数定义-Description & Usage:
str_locate(string, pattern)
str_locate_all(string, pattern)
参数列表-Arguments:
string
: 字符串,字符串向量。pattern
: 匹配字符。Locate the position of patterns in a string.
val <- c("abca", 123, "cba")
# 匹配a在字符串中的位置
str_locate(val, "a")
# start end
# [1,] 1 1
# [2,] NA NA
# [3,] 3 3
# 用向量匹配
str_locate(val, c("a", 12, "b"))
# start end
# [1,] 1 1
# [2,] 1 2
# [3,] 2 2
# 以字符串matrix格式返回
str_locate_all(val, "a")
# [[1]]
# start end
# [1,] 1 1
# [2,] 4 4
#
# [[2]]
# start end
#
# [[3]]
# start end
# [1,] 3 3
# 匹配a或b字符,以字符串matrix格式返回
str_locate_all(val, "[ab]")
# [[1]]
# start end
# [1,] 1 1
# [2,] 2 2
# [3,] 4 4
#
# [[2]]
# start end
#
# [[3]]
# start end
# [1,] 2 2
# [2,] 3 3
# More Examples
fruit <- c("apple", "banana", "pear", "pineapple");fruit
# [1] "apple" "banana" "pear" "pineapple"
str_locate(fruit, "$")
# start end
# [1,] 6 5
# [2,] 7 6
# [3,] 5 4
# [4,] 10 9
str_locate(fruit, "a")
# start end
# [1,] 1 1
# [2,] 2 2
# [3,] 3 3
# [4,] 5 5
str_locate(fruit, "e")
str_locate(fruit, c("a", "b", "p", "p"))
str_locate_all(fruit, "a")
str_locate_all(fruit, "e")
str_locate_all(fruit, c("a", "b", "p", "p"))
# Find location of every character
str_locate_all(fruit, "")
函数定义-Description & Usage:
str_extract(string, pattern)
str_extract_all(string, pattern, simplify = FALSE)
参数列表-Arguments:
string
: 字符串,字符串向量。pattern
: 匹配字符。simplify
: 返回值,TRUE
返回matrix
,FALSE
返回字符串向量Extract matching patterns from a string.
val <- c("abca4", 123, "cba2")
# 返回匹配的数字
str_extract(val, "\\d")
# [1] "4" "1" "2"
# 返回匹配的字符
str_extract(val, "[a-z]+")
# [1] "abca" NA "cba"
val <- c("abca4", 123, "cba2")
str_extract_all(val, "\\d")
# [[1]]
# [1] "4"
#
# [[2]]
# [1] "1" "2" "3"
#
# [[3]]
# [1] "2"
str_extract_all(val, "[a-z]+")
# [[1]]
# [1] "abca"
#
# [[2]]
# character(0)
#
# [[3]]
# [1] "cba"
# More Examples
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
# extract digits
str_extract(shopping_list, "\\d")
# [1] "4" NA NA "2"
# extract character
str_extract(shopping_list, "[a-z]+")
# [1] "apples" "bag" "bag" "milk"
# extract 4 characters
str_extract(shopping_list, "[a-z]{1,4}")
# [1] "appl" "bag" "bag" "milk"
str_extract(shopping_list, "\\b[a-z]{1,4}\\b")
# [1] NA "bag" "bag" "milk"
# Extract all matches
str_extract_all(shopping_list, "[a-z]+")
str_extract_all(shopping_list, "\\b[a-z]+\\b")
str_extract_all(shopping_list, "\\d")
# Simplify results into character matrix
str_extract_all(shopping_list, "\\b[a-z]+\\b", simplify = TRUE)
str_extract_all(shopping_list, "\\d", simplify = TRUE)
# Extract all words
str_extract_all("This is, suprisingly, a sentence.", boundary("word"))
str_conv
: 字符编码转换str_to_upper
: 字符串转成大写str_to_lower
: 字符串转成小写,规则同str_to_upper
str_to_title
: 字符串转成首字母大写,规则同str_to_upper
函数定义-Description & Usage:
str_conv(string, encoding)
参数列表-Arguments:
string
: 字符串,字符串向量。encoding
: 编码名。对中文进行转码处理。Specify the encoding of a string.
# Example from encoding?stringi::stringi
x <- rawToChar(as.raw(177));x
# [1] "?
str_conv(x, "ISO-8859-2") # Polish "a with ogonek"
# [1] "ą"
str_conv(x, "ISO-8859-1") # Plus-minus
# [1] "±"
# 把中文字符字节化
x <- charToRaw('你好');x
# [1] c4 e3 ba c3
# 默认RStudio系统字符集为UTF-8
str_conv(x, "UTF-8")
# [1] "你好"
# 默认win系统字符集为GBK,GB2312为GBK字集,转码NO正常
str_conv(x, "GBK")
# [1] "浣犲ソ"
str_conv(x, "GB2312")
# input data \xffffffa0 in current source encoding could not be converted to Unicodeinput data \xffffffbd in current source encoding could not be converted to Unicode[1] "浣\032濂\032"
# 转UTF-8失败
str_conv(x, "UTF-8")
# [1] "���"
# Warning messages:
# 1: In stri_conv(string, encoding, "UTF-8") :
# input data \xffffffc4 in current source encoding could not be converted to Unicode
# 2: In stri_conv(string, encoding, "UTF-8") :
# input data \xffffffe3\xffffffba in current source encoding could not be converted to Unicode
# 3: In stri_conv(string, encoding, "UTF-8") :
# input data \xffffffc3 in current source encoding could not be converted to Unicode
把unicode转UTF-8
x1 <- "\u5317\u4eac"
str_conv(x1, "UTF-8")
# [1] "北京"
函数定义-Description & Usage:
str_to_upper(string, locale = "")
str_to_lower(string, locale = "")
str_to_title(string, locale = "")
参数列表-Arguments:
string
: 字符串。locale
:按哪种语言习惯排序字符串大写转换Convert case of a string.:
val <- "I am conan. Welcome to my blog! http://fens.me"
# 全大写
str_to_upper(val)
# [1] "I AM CONAN. WELCOME TO MY BLOG! HTTP://FENS.ME"
# 全小写
str_to_lower(val)
# [1] "i am conan. welcome to my blog! http://fens.me"
# 首字母大写
str_to_title(val)
# [1] "I Am Conan. Welcome To My Blog! Http://Fens.Me"
# More Examples
dog <- "The quick brown dog";dog
str_to_upper(dog)
# [1] "THE QUICK BROWN DOG"
str_to_lower(dog)
# [1] "the quick brown dog"
str_to_title(dog)
# [1] "The Quick Brown Dog"
# Locale matters!
str_to_upper("i") # English
str_to_upper("i", "tr") # Turkish
boundary
: 定义使用边界coll
: 定义字符串标准排序规则。fixed
: 定义用于匹配的字符,包括正则表达式中的转义符regex
: 定义正则表达式字符串在平常的数据处理中经常用过,需要对字符串进行分割、连接、转换等操作,本篇中通过介绍stringr,灵活的字符串处理库,可以有效地提高代码的编写效率。有了好的工具,在用R语言处理字符串就顺手了。
#生成数据
set.seed(123);
n = 5000000;
p = 5;
system.time(x <- matrix(rnorm(n * p), n, p));
x = cbind(1, x);
bet = c(2, rep(1, p));
y = c(x %*% bet) + rnorm(n);
# Garbage Collection
gc();
dat = as.data.frame(x);
rm(x);
gc();
dat$y = y;
rm(y);
gc();
colnames(dat) = c(paste("x", 0:p, sep = ""), "y");
gc();
?gc()